performance issue
PerfBench: Can Agents Resolve Real-World Performance Bugs?
Garg, Spandan, Moghaddam, Roshanak Zilouchian, Sundaresan, Neel
Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown promise in automated bug fixing, existing benchmarks primarily focus on functional correctness and fail to evaluate agents' abilities to identify and resolve non-functional issues like performance bugs. We introduce PerfBench, a benchmark comprising 81 real-world performance bug-fixing tasks from popular .NET repositories on GitHub. Unlike existing benchmarks that rely on pre-existing test suites, PerfBench features a novel evaluation harness that allows agents to generate their own performance benchmarks and validates fixes by comparing execution metrics collected for developer fix and agent fix. Each task in PerfBench is derived from actual developer fixes linked to performance-related issues, which are then verified by human experts, ensuring real-world relevance. Our evaluation reveals that current state-of-the-art coding agents struggle with performance optimization tasks, with baseline OpenHands agent achieving only a ~3% success rate on our benchmark. We develop OpenHands-Perf-Agent, which incorporates performance-aware tooling and instructions and achieves a ~20% success rate on the benchmark. We show that by ensuring the agent has proper instructions to benchmark its changes and tooling for benchmark output processing, we can improve the agent performance significantly, but room for improvement still remains. PerfBench provides a challenging test set for furthering the capabilities of agents in fixing performance issues.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Washington > King County > Redmond (0.04)
- (3 more...)
Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training
Deng, Yangtao, Zhang, Lei, Wang, Qinlong, Zhi, Xiaoyun, Zhang, Xinlei, Jiang, Zhuo, Xu, Haohan, Wang, Lei, Song, Zuquan, Liu, Gaohong, Bai, Yang, Wang, Shuguang, Xiao, Wencong, Ye, Jianxi, Yu, Minlan, Xu, Hong
Reliability is essential for ensuring efficiency in LLM training. However, many real-world reliability issues remain difficult to resolve, resulting in wasted resources and degraded model performance. Unfortunately, today's collective communication libraries operate as black boxes, hiding critical information needed for effective root cause analysis. We propose Mycroft, a lightweight distributed tracing and root cause analysis system designed to address previously hidden reliability issues in collective communication. Mycroft's key idea is to trace collective communication states and leverage internal control and data dependencies to resolve reliability problems in LLM training. Mycroft has been deployed at ByteDance for over six months to debug collective communication related issues at runtime. It detected anomalies within 15 seconds in 90% of cases and identified the root cause within 20 seconds in 60% of cases. We also conducted extensive fault injection experiments to demonstrate Mycroft's capability and efficiency.
PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production
Guan, Yu, Yin, Zhiyu, Chen, Haoyu, Cheng, Sheng, Yang, Chaojie, Qian, Kun, Xu, Tianyin, Zhang, Yang, Zhao, Hanyu, Li, Yong, Lin, Wei, Cai, Dennis, Zhai, Ennan
Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present PerfTracker, the first online troubleshooting system utilizing fine-grained profiling, to diagnose performance issues of large-scale model training in production. PerfTracker can diagnose performance issues rooted in both hardware (e.g., GPUs and their interconnects) and software (e.g., Python functions and GPU operations). It scales to LMT on modern GPU clusters. PerfTracker effectively summarizes runtime behavior patterns of fine-grained LMT functions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. PerfTracker has been deployed as a production service for large-scale GPU clusters of O(10, 000) GPUs (product homepage https://help.aliyun.com/zh/pai/user-guide/perftracker-online-performance-analysis-diagnostic-tool). It has been used to diagnose a variety of difficult performance issues.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Washington > King County > Renton (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
What It Takes for A.I. to Ruin Google's Weekend
This article is from Big Technology, a newsletter by Alex Kantrowitz. On a Saturday night in mid-September, a senior Google engineer shared some rough news with more than 50 colleagues. Part of the company's cloud services offering was failing Anthropic, a darling A.I. startup and key strategic customer, and they'd have to work overtime to fix it. To repair the faulty part of its service--an underperforming and unstable NVIDIA H100 cluster--Google Cloud leadership initiated a seven-day-per-week sprint for the next month. The downside of not making it work, the senior engineer said, was "too large, for Anthropic (most importantly), for Google Cloud, and for Google," according to documents I reviewed.
Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection
Gu, Wenwei, Liu, Jinyang, Chen, Zhuangbin, Zhang, Jianping, Su, Yuxin, Gu, Jiazhen, Feng, Cong, Yang, Zengyin, Lyu, Michael
Performance issues permeate large-scale cloud service systems, which can lead to huge revenue losses. To ensure reliable performance, it's essential to accurately identify and localize these issues using service monitoring metrics. Given the complexity and scale of modern cloud systems, this task can be challenging and may require extensive expertise and resources beyond the capacity of individual humans. Some existing methods tackle this problem by analyzing each metric independently to detect anomalies. However, this could incur overwhelming alert storms that are difficult for engineers to diagnose manually. To pursue better performance, not only the temporal patterns of metrics but also the correlation between metrics (i.e., relational patterns) should be considered, which can be formulated as a multivariate metrics anomaly detection problem. However, most of the studies fall short of extracting these two types of features explicitly. Moreover, there exist some unlabeled anomalies mixed in the training data, which may hinder the detection performance. To address these limitations, we propose the Relational- Temporal Anomaly Detection Model (RTAnomaly) that combines the relational and temporal information of metrics. RTAnomaly employs a graph attention layer to learn the dependencies among metrics, which will further help pinpoint the anomalous metrics that may cause the anomaly effectively. In addition, we exploit the concept of positive unlabeled learning to address the issue of potential anomalies in the training data. To evaluate our method, we conduct experiments on a public dataset and two industrial datasets. RTAnomaly outperforms all the baseline models by achieving an average F1 score of 0.929 and Hit@3 of 0.920, demonstrating its superiority.
RAPGen: An Approach for Fixing Code Inefficiencies in Zero-Shot
Garg, Spandan, Moghaddam, Roshanak Zilouchian, Sundaresan, Neel
Performance bugs are non-functional bugs that can even manifest in well-tested commercial products. Fixing these performance bugs is an important yet challenging problem. In this work, we address this challenge and present a new approach called Retrieval-Augmented Prompt Generation (RAPGen). Given a code snippet with a performance issue, RAPGen first retrieves a prompt instruction from a pre-constructed knowledge-base of previous performance bug fixes and then generates a prompt using the retrieved instruction. It then uses this prompt on a Large Language Model (such as Codex) in zero-shot to generate a fix. We compare our approach with the various prompt variations and state of the art methods in the task of performance bug fixing. Our evaluation shows that RAPGen can generate performance improvement suggestions equivalent or better than a developer in ~60% of the cases, getting ~39% of them verbatim, in an expert-verified dataset of past performance changes made by C# developers.
- North America > United States > Washington > King County > Redmond (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (2 more...)
DeLag: Using Multi-Objective Optimization to Enhance the Detection of Latency Degradation Patterns in Service-based Systems
Traini, Luca, Cortellessa, Vittorio
Abstract--Performance debugging in production is a fundamental activity in modern service-based systems. The diagnosis of performance issues is often time-consuming, since it requires thorough inspection of large volumes of traces and performance indices. In this paper we present DeLag, a novel automated search-based approach for diagnosing performance issues in service-based systems. DeLag identifies subsets of requests that show, in the combination of their Remote Procedure Call execution times, symptoms of potentially relevant performance issues. We call such symptoms Latency Degradation Patterns. DeLag simultaneously searches for multiple latency degradation patterns while optimizing precision, recall and latency dissimilarity. Experimentation on 700 datasets of requests generated from two microservice-based systems shows that our approach provides better and more stable effectiveness than three state-of-the-art approaches and general purpose machine learning clustering algorithms. DeLag is more effective than all baseline techniques in at least one case study (with p 0.05 and non-negligible effect size). Moreover, DeLag outperforms in terms of efficiency the second and the third most effective baseline techniques on the largest datasets used in our evaluation (up to 22%). In order to support this fastpaced issue, and initial understanding, scoping and localization release cycle, IT organizations often employ several are among the most time-consuming phases during debugging. Unfortunately, frequent software releases often service-based systems [9], [10], [11], [12], [13], [14], [15], the hamper the ability to deliver high quality software [3]. For reduction of the manual effort and the time needed is still example, widely used performance assurance techniques, critical. Also, given the complexity of these systems rely on pattern mining to spot patterns in trace attributes and their workloads [6], it is often unfeasible to proactively (e.g., request size, response size, RPCs execution times) detect performance issues in a testing environment [7].
- North America > United States > New York > New York County > New York City (0.05)
- Europe > Italy > Abruzzo > L'Aquila Province > L'Aquila (0.04)
- North America > United States > West Virginia (0.04)
- (7 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.88)
Scenario-based simulation: Combining HD maps and real-world traffic data - atlatec
If you work in the ADAS/Autonomous Vehicles field, you are probably familiar with HD maps – virtual recreations of real-world roads including their 3D profile, driving rules, inter-connectivity of lanes etc. A lot of these HD maps go into the simulation domain, where car makers and suppliers leverage them to train new ADAS/AV systems or for verification/validation of features from those domains. The reason to use HD maps of real-world roads (rather than just generic, fictional routes created from scratch) is simple: In the end, you want your system to perform in the real world – so you want to optimize for real-world conditions as early as possible, starting in simulation. As we all know, the real world is nothing if not random, and you will encounter many situations you would rarely find in generic data sets. So far, so good: These HD maps can be used to properly train lane-keep assistance or lane-departure warning systems, validate speed limit sign detection and many other systems. However, a map only contains the static features of an environment – what about ADAS/AV features that are supposed to react to other traffic participants?
Amazon launches AWS BugBust to spur adoption of AI-powered coding tools
Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out. Software failures are expensive -- and on the rise. An estimated 19% to 23% of software development projects fail, and Standish Group found that "challenged" projects -- i.e., those that fail to meet scope, time, or budget expectations -- account for about 52% of software projects. According to a joint project by Undo and Cambridge Judge Business School, these bugs cost enterprises about $61 billion annually, and around 620 million developer hours are wasted on debugging.
Glitches, long hours and delays: Inside Cyberpunk 2077's disastrous rollout
CD Projekt SA Chief Executive Officer Marcin Iwiński made a public mea culpa this week about the disastrous rollout of the video game Cyberpunk 2077 in December. He took personal responsibility and asked fans not to blame the team. In a somber five-minute video address and accompanying blog post, Iwiński acknowledged the game "did not meet the quality standard we wanted to meet. I and the entire leadership team are deeply sorry for this." Iwiński's apology, the second within a month, was an attempt to restore the Polish company's reputation with scores of fans -- and investors -- who had waited eight years for the game, only to discover it was riddled with bugs and performance issues when it was finally released.
- Europe > Western Europe (0.04)
- Europe > Poland > Masovia Province > Warsaw (0.04)